Balancing between over-weighting and under-weighting in supervised term weighting
نویسندگان
چکیده
منابع مشابه
Balancing between over-weighting and under-weighting in supervised term weighting
Supervised term weighting could improve the performance of text categorization. A way proven to be effective is to give more weight to terms with more imbalanced distributions across categories. This paper shows that supervised term weighting should not just assign large weights to imbalanced terms, but should also control the trade-off between over-weighting and under-weighting. Overweighting,...
متن کاملReducing Over-Weighting in Supervised Term Weighting for Sentiment Analysis
Recently the research on supervised term weighting has attracted growing attention in the field of Traditional Text Categorization (TTC) and Sentiment Analysis (SA). Despite their impressive achievements, we show that existing methods more or less suffer from the problem of over-weighting. Overlooked by prior studies, over-weighting is a new concept proposed in this paper. To address this probl...
متن کاملSupervised Term Weighting Methods for URL Classification
Many term weighting methods are suggested in the literature for Information Retrieval and Text Categorization. Term weighting method, a part of feature selection process is not yet explored for URL classification problem. We classify a web page using its URL alone without fetching its content and hence URL based classification is faster than other methods. In this study, we investigate the use ...
متن کاملEmpirical Term Weighting
Our system used an empirical method for estimating term weights directly from relevance judgements, avoiding various standard but potentially troublesome assumptions. It is common to assume, for example, that weights vary with term frequency ( ) and inverse document frequency ( ) in a particular way, e.g., , but the fact that there are so many variants of this formula in the literature suggests...
متن کاملRelevance Weighting Using Distance Between Term Occurrences
Recent work has achieved promising retrieval performance using distance between term occurrences as a primary estimator of document relevance. A major bene t of this approach is that relevance scoring does not rely on collection frequency statistics. A theoretical framework for lexical spans is now proposed which encompasses these approaches and suggests a number of important directions for fut...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Information Processing & Management
سال: 2017
ISSN: 0306-4573
DOI: 10.1016/j.ipm.2016.10.003